PreTrendsPower: Roth (2022) paper review + silent-failure guard#463
Merged
Conversation
PR-A of the 3-PR PreTrendsPower methodology review sequence (paper review -> implementation audit -> R parity goldens). Produces the structured methodology contract that will inform the PR-B audit of diff_diff/pretrends.py against the paper's exact equations. The review covers: - Propositions 1-4 (bias decomposition, sign-of-bias under monotone trend, variance decomposition, convexity gives variance reduction) - Power calculation algorithm for the slope gamma_p at target power p - Bias and coverage calculation framework (analytical via tmvtnorm-equivalent truncated MVN moments; simulation fallback) - Implementation notes and 11-item gaps list flagging items the PR-B audit will need to resolve (notably: NIS vs Wald acceptance region, violation-type extensions beyond linear, R-package version pin for parity) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 covariance block convention: add explicit "post-first" block decomposition
paragraph defining Sigma_11=Var[beta_hat_post], Sigma_22=Var[beta_hat_pre],
Sigma_12=Cov(beta_hat_post, beta_hat_pre). Roth's stacked vector is (pre, post)'
but his proofs partition Var[(beta_hat_post, beta_hat_pre)'] post-first; without
the explicit convention, Sigma_22 references in the propositions were ambiguous.
- P1 plug-in variance dimensional bug: change sigma^2_{tau_hat} = l' Sigma l to
l' Sigma_11 l (since l in R^M and Sigma_11 = Var[beta_hat_post] is the only
block of correct dimensions). Add explicit reminder pointing back to the
convention.
- P1 missing Note/Deviation labels in copy-ready Registry section: inline
**Note (deviation from paper):** for slope-of-best-fit acceptance region,
constant/last_period/custom violation types, and the slope pretest_form
enum value. Mark joint Wald + custom B(Sigma) as paper-supported framework
(Propositions 1, 3, 4 apply) but not separately tabulated. Mark "individual"
(NIS) as the only paper-analyzed form.
- P3 brittle line reference: replace "line 2758" with the heading reference
"## PreTrendsPower stub".
- Algorithm step 2 cleanup: clarify sigma_t = sqrt(Sigma_22[t, t]) under the
new convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P2 API surface drift: add explicit Status disclaimer at top of "Methodology Registry Entry" section flagging the block as proposed-replacement-text-not- yet-merged (mirrors the established goodman-bacon-2021-review.md:14 pattern). Restructure the Tuning Parameters table to add a Status column distinguishing "Current" (matches diff_diff/pretrends.py:442-447) from "Proposed" (requires API extension via the PR-B audit), so the block is not mistaken for the shipped surface. - P2 Assumption 1 design-heuristic overclaim: drop the "balanced panel + cluster-robust + equicorrelated" design-metadata heuristic. Roth's Assumption 1 is a condition on the estimated Sigma, not on the experimental design. Replace with the requirement that any sharper sign-of-bias warning must come from a direct numerical check of Sigma's structure. - P3 dead internal reference: remove the parenthetical pointer to `feedback_verdict_powered_by_tools.md` (a Claude private-memory artifact, not a git-tracked file in this repo). - P3 source-of-truth split: addressed by the Status disclaimer above, which explicitly names the existing PreTrendsPower REGISTRY stub as the sole authoritative methodology contract until PR-B lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 Proposition 4 convexity overclaim: R1/R2 fixes added language to the
Alternative-acceptance-regions block, requirements checklist, pretest_form
row, and acceptance_region row that said "Propositions 1, 3, 4 apply to any
B(Sigma)". That overstates Roth: Proposition 1 (conditional mean) and
Proposition 3 (conditional variance) hold for any measurable B, but
Proposition 4 (variance reduction / over-coverage under parallel trends)
begins "Suppose that B(Sigma) is a convex set." Nonconvex custom pretests
lose the variance-reduction guarantee even though the conditional-moment
formulas still hold.
Fixed:
- "Custom user-supplied B(Sigma)" bullet now distinguishes Props 1+3 (any B)
from Prop 4 (requires convex B) and explicitly notes nonconvex pretests
lose the over-coverage guarantee.
- Requirements checklist now reads "joint Wald (convex, Props 1+3+4 all
apply); custom B (Props 1+3 apply to any measurable B, Prop 4 only if B is
convex)".
- `pretest_form` row in tuning parameters: same convexity split.
- `acceptance_region` row: "Propositions 1, 3 apply to any measurable B;
Proposition 4 / variance-reduction guarantee additionally requires B to
be convex".
Joint Wald (line 47) already correctly conditioned on convexity ("apply to
this B since it is convex") and was left as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 nonlinear-trend "deviation" overstatement: walk back the "**Note
(deviation from paper):**" labels on nonlinear violation hypotheses
(constant level shift, last-period jump, custom delta vector). Roth's
Section III ("Practical Recommendations") explicitly endorses power
analyses against hypothesized nonlinear trends through the `pretrends`
package, so the GENERAL nonlinear-power capability is paper-supported;
only the SPECIFIC named API parameterizations are R-package conventions.
Updates: Requirements checklist line and Gaps "Nonlinear violations" +
"Custom delta" bullets reframed to distinguish paper-supported framework
(Section III + Propositions 1+3 applying to any trend) from package-API
parameterizations.
- P1 absolute "UNDER-cover" claim: Roth's wording after Prop 4 is "CIs will
tend to under-cover if the bias is sufficiently large" — the under-coverage
direction is contingent on bias magnitude, not universal. Updated the
Implication paragraph after Proposition 4 to keep the conditionality
("tend to UNDER-cover **if the bias is sufficiently large** ... not
universal").
- P2 "stub" mischaracterization: the live `## PreTrendsPower` entry at
REGISTRY.md:2758-2808 is a populated 50-line block (Wald-test framed),
not a stub. Updated the Status header to describe it accurately as "a
populated 50-line block framed primarily around a joint-Wald pre-trends
test" and clarify that this file is a non-authoritative source audit.
Slope-of-best-fit "deviation from paper" labels (lines 48, 159, 195) kept
as-is: Roth does NOT analyze a slope-t-stat as an acceptance region in
Section I or Section III; the slope t-stat in Table 1 is only an observed
property of surveyed papers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 alpha/1.96 internal inconsistency: the proposed registry block exposed
`alpha` as a parameter but hard-coded 1.96 in B_NIS, the plug-in CI, the
power formula, the K=1 closed-form, and the coverage formulas. Promoted
to REGISTRY this would document the wrong inference rule for any
alpha != 0.05. Generalize all five formulas to z_{1-alpha/2}; keep "=1.96"
parenthetical annotations to ground them at the paper's running default.
- P2 sample vs population identity: "beta_hat_pre, which equals delta_pre
under no anticipation" muddied Roth's stochastic model. Roth Equations
(1)-(2) keep beta_hat (random draw) and beta (mean) distinct; under no
anticipation, beta_pre = delta_pre (population), not beta_hat_pre =
delta_pre. Reworded to "beta_hat_pre, whose mean beta_pre equals
delta_pre under no anticipation" with explicit "random draw is not itself
equal to delta_pre" caveat.
- P2 "11 of 12" overcount: Section I.B of Roth (2022) does not support
"11 of 12 use individual significance" as the count I wrote. The paper
reports: 12 of 12 plot pointwise CIs (visual-NIS inspection possible),
5 of 12 explicitly discuss individual significance, 1 of 12 reports a
joint-significance test, several rely on visual inspection without a
formal criterion. Restated using the paper's own breakdown and framed
NIS as the implicit common denominator rather than a literal explicit-
rule majority.
- P3 paper methodology vs library extension labeling: the Figure-1-style
plotting and HonestDiD-composition checklist items inside the copy-ready
block conflated paper-derived methodology with diff-diff design choices.
Added "**Note (library extension):**" labels to both, distinguishing the
paper-derived numerical content (bias and CI by gamma_p; methodology-
agnostic on how beta_hat/Sigma_hat is produced) from the library
presentation and cross-estimator composition decisions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user direction (option A: holistic restructure) after three consecutive
rounds of "paper-derived vs library-extension boundary" P1s (R3 Prop 4
convexity, R4 nonlinear-trend "deviation", R6 analytical/simulation as
paper-required). The recurring class warranted naming the invariant and
fixing it structurally rather than another surface patch.
Holistic fix:
- Renamed the single Requirements-checklist block into two named
subsections inside the copy-ready Registry entry:
### Paper-derived requirements (must satisfy to be faithful to Roth 2022)
### Library design choices (extensions beyond Roth 2022)
- Moved each non-paper item (simulation fallback, method/n_sim knobs,
pretest_form/acceptance_region surface, named non-linear violation
parameterizations, Figure-1 plotting interface, HonestDiD composition)
into the Library Design Choices subsection with explicit framing that
the library may keep, drop, or extend each item via the follow-up audit.
- Reworded the Tuning Parameters table's Status column to a clearer
Source column ("Paper-derived" vs "Library extension") so every row's
origin is unambiguous.
P2 (Table 2 factual error):
- Coverage bullet under "Empirical Findings" said unconditional gamma_{0.8}
range was "53% to 98%". Actual Table 2 range for tau_bar unconditional
under gamma_{0.8} is 14% (Deschenes et al. 2017) to 98% (Lafortune et al.;
Markevich & Zhuravskaya). Fixed.
P3 (transient process / line refs):
- Dropped "PR-B in the 3-PR sequence" language from the Status banner and
Tuning Parameters note.
- Replaced literal "diff_diff/pretrends.py:442-447" with stable symbol
refs (`compute_pretrends_power`, `PreTrendsPowerResults`).
- Replaced "lines 2758-2808" REGISTRY range with a non-line-numbered
description ("populated block framed primarily around a joint-Wald
pre-trends test").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 joint-Wald + tmvtnorm still in paper-derived checklist: the R6 restructure narrowed the recurring "paper vs library" class but two items remained mislabeled — joint Wald (a paper-mentioned alternative used by 1 of 12 surveyed papers, not the empirical exercise) and the tmvtnorm-specific analytical backend (one of several valid numerical paths; the paper requires producing the right quantities, not using this specific solver). Moved both to "Library design choices (paper-supported alternatives and extensions)" with explicit framing that the paper requirement is at the QUANTITY level (conditional moments, power, bias, coverage), not at the API or numerical-backend level. Updated the subsection heading to "paper-supported alternatives and extensions beyond Roth 2022" to make the two categories explicit. - P2 Proposition 4 summary row dropped "if bias is sufficiently large": the Key Theorems table row claimed CIs "UNDER-cover under violations" unconditionally, contradicting the conditional version stated correctly elsewhere in the file. Rewrote the row to limit the variance-reduction guarantee to its strict statement and pushed the CI coverage discussion into a clearly-conditioned addendum. - P2 API misattribution: I had said `compute_pretrends_power` exposes `alpha, power, violation_type, violation_weights` (which is actually the `PreTrendsPower` class signature). The helper's actual signature is `compute_pretrends_power(results, M, alpha, target_power, violation_type, pre_periods)`. Updated the Status banner and Tuning Parameters note to name both surfaces explicitly with their actual signatures, and clarified that the follow-up audit will reconcile them with each other and with the proposed contract. - Related P2-adjacent overclaim: line 217 said the library "implements a Roth-2022 framework"; line 270 said it "implements computations under arbitrary Sigma via Proposition 1". Both overstate what `pretrends.py` currently does. Tightened to "Wald-test-based pre-trends MDV/power workflow ... computes Wald power/MDV from the pre-period VCV block rather than the full arbitrary-Sigma Proposition 1/3/4 conditional-moment computations", and reframed the heteroskedastic- Sigma gap as a future capability conditional on the Prop 1/3/4 path being added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 verdict was ✅ Looks good (no P0/P1) — addressing remaining polish items. - P2 internal contradiction on simulation-fallback requirement: the "Computational shortcut (footnote 8)" paragraph said the library "should support both an analytical truncated-multivariate-normal path AND a simulation fallback", contradicting the Library design choices subsection which correctly classifies simulation fallback as a library robustness choice. Rewrote the paragraph to state the paper-derived requirement at the quantity level (compute the correct conditional moments and probabilities) and explicitly defer backend choice to the Library design choices subsection. - P2 joint-Wald R-parity overclaim: the Gaps section said "library should implement both but test against R `pretrends` for the joint-Wald case (Roth's package supports it)." Verified via R-package source/docs that the published `pretrends` package exposes only NIS-based surfaces (`pretrends()`, `slope_for_power()`, `*_NIS` helpers) — no joint-Wald parity target. Rewrote to make explicit that joint Wald is theoretically admissible under the propositions but parity would need an independent fixture/derivation, not direct R-package parity. - P3 composition overclaim: the Relation section listed `TwoWayFixedEffects` and "any estimator producing an event-study coefficient vector" as currently composable, but the shipped `compute_pretrends_power` adapter only dispatches `MultiPeriodDiDResults`, `CallawaySantAnnaResults`, `SunAbrahamResults` and raises `TypeError` otherwise. Split into "Currently composes with" (the three shipped adapter targets) and "theoretical compatibility" (extends to TWFE etc., pending follow-up audit adapters). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 CS/SA covariance-source deviation: the R7 edit said current `pretrends.py` "computes Wald power/MDV from the pre-period variance- covariance block", but only the `MultiPeriodDiDResults` branch extracts the full pre-period sub-VCV (pretrends.py:592-601). The CS branch (pretrends.py:609-652) and SA branch (pretrends.py:660-687) hard-code `vcov = np.diag(ses^2)`, dropping the pre-period correlations the paper's Propositions rely on. Added an explicit "Note (deviation in current covariance-source)" bullet in the "Relation to Existing diff-diff Estimators" section documenting the per-result-type split and cross-referencing the conservative-variance framing in REPORTING.md. The follow-up audit will decide whether to extend full-sub-VCV extraction to CS/SA or keep the diag fallback as a deliberate variance- conservatism choice. - P1 Assumption 1 model-vs-estimated Sigma: the R2 edit said "Assumption 1 is a condition on the *estimated covariance matrix* Sigma". Roth's Section II.B states Assumption 1 on the *model* covariance Sigma (the population VCV in Equation 1). Software can only inspect the *estimated* Sigma_hat, so any direct numerical check is a HEURISTIC implementation proxy, not the paper's assumption itself. Rewrote the Prop 2 Implementation-use cell to (a) name Sigma as the model covariance, (b) acknowledge Sigma_hat as the implementation surrogate, (c) require any sharper warning based on Sigma_hat be labeled heuristic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 "conservative variance source" framing for diag fallback is wrong:
R9 fix added a Note saying the CS/SA `diag(ses^2)` fallback could be
retained as a "deliberate variance-conservatism choice". Roth's
Propositions 1+3+4 (and the MDV/power computation w' Σ_22^{-1} w)
operate on the full pre-period covariance block Σ_22 — dropping the
off-diagonals is a non-paper approximation, not a paper-supported
numerical option, and the direction of the discrepancy with the full-
Σ_22 calc depends on the sign and magnitude of the dropped
correlations (not guaranteed conservative). Reworded the Note to
describe `diag(ses^2)` explicitly as a non-paper approximation; if the
follow-up audit retains it, an explicit REGISTRY.md Note describing
the approximation + its possible miscalibration is required instead.
- P3 R package version not pinned: R `pretrends` package API claims in
the Joint-Wald and version-pin Gaps bullets were temporally unstable.
Softened the Joint-Wald observation to "as observed at the time of
this review (specific commit not pinned)" with an explicit ask for the
follow-up audit to record the exact revision. Strengthened the
version-pin Gaps bullet to enumerate exactly which surface claims
depend on the unpinned R-package surface (NIS-only, no joint-Wald,
`pretrends()` / `slope_for_power()` / `*_NIS` helpers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R11 verdict was ✅ Looks good (second consecutive) — addressing remaining polish items. - P2 CS/SA covariance-source distinction: R9/R10 Note lumped CS and SA together as "even when an event-study VCV is available". That is true for CallawaySantAnnaResults (which persists event_study_vcov at staggered_results.py:126-128) but NOT for SunAbrahamResults (which does not expose an event-study or cohort covariance surface; verified in sun_abraham.py:30-88). Split the Note into per-result-type bullets: CS diag fallback is a deliberate choice (off-diagonals available); SA diag fallback is forced (off-diagonals not persisted on the result object, so Roth-faithful off-diagonal support requires upstream surface work first). - P3 unsourced K<=5 / K>10 cutoffs: the "For K <= 5, analytical methods are fast. For K > 10, simulation is preferable" guidance in Computational Considerations was a hunch, not benchmarked or paper-derived. Relabeled as "tentative heuristic (not benchmarked in this review and not specified by the paper)" with explicit ask for the follow-up audit to benchmark or replace. - P3 R-package version pin still implicit: R10 fix softened the Joint Wald observation to "as observed at the time of this review" but the reviewer wanted a more explicit catch-all. Strengthened the Gaps bullet to enumerate every R-package surface claim that needs re-verification at a pinned revision (NIS-based surface, no joint- Wald target, named helpers, nonlinear-trend support claim) and label them collectively provisional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R12 verdict was ✅ Looks good (third consecutive) — addressing remaining
polish items.
- P2 "uninformative pretests are unambiguously harmful" overstates II.D:
Roth's Section II.D + Equation (4) say the net publication-screening
effect is ambiguous, with underpowered pretests being "least effective
and potentially harmful" — not unambiguously harmful in every
parameterization. Internally inconsistent with my own L266 summary
("net sign depends on which dominates"). Reworded the Publication-bias
edge-case bullet to match the paper's own framing.
- P3 "adding more pre-periods does NOT help (and can hurt)" overstates
Section I.D: paper only says additional distant pre-periods may not
help / may be uninformative for near-treatment shocks. Softened to
"may not help / may be uninformative" without the active-harm claim.
- P3 gamma_max = "largest |gamma| at which power = 1" is impossible:
under the normal model power approaches 1 only asymptotically, so
no finite gamma_max satisfies power(gamma_max) = 1 exactly. Replaced
with a doubling-expansion + bisection bracketing strategy.
- P3 audit-notes vs registry-candidate boundary unclear: added an
explicit HTML-comment boundary marker just before "## Implementation
Notes", retitled that heading to "(audit notes — NOT registry-
candidate)", and updated the Status banner to call out the boundary
explicitly so tentative heuristics + provisional R-package claims
cannot be misread as part of the proposed REGISTRY replacement text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates
R13 verdict was ✅ Looks good (4th consecutive) — addressing remaining
polish items.
- P2 Wald vs NIS power-object conflation: the CS/SA covariance-source
Note said "the key MDV/power quantity is w' Σ_22^{-1} w". That scalar
is the current library's Wald / noncentral-χ² object, not Roth's
paper-analyzed NIS power object. Roth's NIS power is the multivariate
box probability P(β̂_pre ∈ B_NIS(Σ)) computed under
β̂_pre ~ N(δ_pre, Σ_22), and Proposition 1's pretest-bias term + Proposition
3's conditional-variance term rely on truncated-MVN moments. Reworded
the Note to separate the two surfaces: Roth's NIS box probability and
the current library's Wald form are BOTH non-conservative under the
diag fallback, but for different reasons, and the audit needs to pick
one of them as the surface to promote.
- P3 scipy.stats over-restriction: replaced "scipy.stats has only the
univariate case" with the accurate distinction —
`scipy.stats.multivariate_normal.cdf` does cover box probabilities,
but SciPy lacks a `tmvtnorm`-equivalent for the truncated-MVN MOMENTS
that Propositions 1 + 3 need.
- P3 discoverability links (sibling updates, follows Bacon precedent at
REGISTRY.md:2611):
* REGISTRY.md `## PreTrendsPower` entry now includes
`Paper review on file: docs/methodology/papers/roth-2022-review.md`
inline with the primary source citation, with explicit note that
REGISTRY remains authoritative until the follow-up audit lands.
* METHODOLOGY_REVIEW.md PreTrendsPower section's "Documentation in
place" list now records the paper review with the 2026-05-17 date;
"Outstanding for promotion" has the paper-review bullet removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R13's REGISTRY pointer pulled the new audit's diag-fallback finding into
authoritative scope; R14 flagged that the authoritative `## PreTrendsPower`
REGISTRY entry didn't yet document the deviation. Adding the inline
documentation per the CLAUDE.md "Documenting Deviations" convention.
- P1 REGISTRY missing `**Note (deviation from paper):**` for diag-VCV
fallback: added a Note inside the `## PreTrendsPower` Edge cases block
documenting all three result-type paths (MultiPeriodDiD: full sub-VCV
when interaction_indices populated, diag otherwise; CS: diag fallback
even with event_study_vcov available — deliberate; SA: diag fallback
forced — SunAbrahamResults has no event-study VCV surface). The Note
cross-references the paper review file and the new TODO entries.
- P2 `compute_pretrends_power(..., violation_type="custom")` is broken
today: added an explicit "**Helper/class API gap observed today**"
call-out in the Status block of the paper review, naming the missing
`violation_weights` parameter on the helper and clarifying that
`"custom"` is class-only today.
- P3 R-package version pin + diag-fallback follow-up: added three TODO.md
rows under "Methodology/Correctness":
* Diag fallback follow-up (Medium) — route CS through event_study_vcov +
extend SunAbrahamResults with event-study/cohort VCV, or formally
retain diag with miscalibration framing.
* R pretrends commit pin (Low) — record the audited revision before
building parity fixture.
* compute_pretrends_power "custom" helper gap (Low) — add
violation_weights to helper signature or document helper as supporting
only linear/constant/last_period.
All three TODO rows tagged "PR-A (Roth paper review, 2026-05-17)" for
provenance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R15 verdict was ✅ Looks good (5th overall ✅) — addressing remaining polish. - P3 CS event_study_vcov availability stated too broadly: previous wording in REGISTRY.md, roth-2022-review.md, and TODO.md said CallawaySantAnnaResults "persists event_study_vcov", but bootstrap CS fits explicitly clear that matrix at staggered.py:2032-2036 (to prevent mixing analytical VCV with bootstrap SEs). Qualified all three surfaces to split non-bootstrap (persists event_study_vcov, diag fallback is a deliberate choice) vs bootstrap (clears event_study_vcov, diag fallback is the only path) CS paths. - P3 "four power calculations" parity target underspecified: the PreTrendsPower row in METHODOLOGY_REVIEW.md said future R parity should cover "the four power calculations" without acknowledging that `compute_pretrends_power(..., violation_type="custom")` is unusable from the helper today (helper does not accept violation_weights). Added clarification that "custom" parity has to run through PreTrendsPower directly until the helper is extended; helper-only parity is limited to linear/constant/last_period. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R16 verdict was ✅ Looks good (6th overall ✅), with all findings classified
P3-informational. Two of the four findings ("None in this PR" / "TODO rows
are correct") required no action; addressing the two with concrete polish
suggestions.
- P3 helper docstring: `compute_pretrends_power` and `compute_mdv` both
accept `violation_type` but don't accept `violation_weights`, so
`violation_type="custom"` is unusable from either helper today. Added
an explicit Note in the `violation_type` docstring entry of both
convenience functions pointing users to instantiate `PreTrendsPower`
directly for custom weights, and cross-referencing the TODO row that
tracks the helper-extension follow-up.
- P3 METHODOLOGY_REVIEW.md "four power calculations" target: now reads
"at a pinned revision" with cross-reference to the TODO row that
tracks the R-package revision pin. Until that lands, the R-package
surface claims in the paper review remain provisional.
Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type
conflicts, ndarray-to-tuple coercion) are pre-existing in code I did not
touch; my edits are docstring-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R17 escalated to ⛔ Blocker after the agentic codex reviewer dug into pretrends.py and surfaced two pre-existing library bugs that my R16 helper docstring update inadvertently directed users toward: - P0 verified: PreTrendsPowerResults.power_at() silently wrong for violation_type="custom". The result dataclass does not persist fitted violation_weights (pretrends.py:77-90), and power_at() falls back to equal weights with an inline "we can't reconstruct" comment at line 230-231. My R16 docstring change pointed users to the broken class path. - P1 verified: linear violation pattern at _get_violation_weights() (pretrends.py:510-515) ignores actual pre-period relative-time labels and constructs a shifted [n-1,...,0] direction from n_pre count alone, so for irregular / anticipation-shifted grids the reported MDV is NOT in Roth's γ units. Per user direction, keeping PR-A paper-review-scoped and deferring both bugs to PR-B (the implementation audit): - Added two High-priority TODO.md rows tagged "PR-A (surfaced by R17 of the iterative codex review)" with explicit line refs + concrete fix outlines (persist normalized fitted weights on PreTrendsPowerResults for the P0; thread relative-time labels through _get_violation_weights for the P1). - Reverted the R16 helper-docstring recommendation that pointed custom users to instantiate PreTrendsPower(..., violation_weights=...). The helper now describes the helper-vs-class API gap AND the PreTrendsPowerResults.power_at() known issue, without recommending the broken class path. Both gaps cross-referenced to TODO.md. Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type conflicts, etc.) are pre-existing in code I did not touch; my edits are docstring + TODO.md only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ar-deviation Note R18 rejected the R17 TODO-only deferral: "a TODO/docstring note does not mitigate a silent statistical bug." Per user direction, applying the minimal partial fix that codex's verdict explicitly offered as acceptable. - P0 silent-failure guard on PreTrendsPowerResults.power_at: rather than doing the full weight-persistence refactor (deferred to PR-B), the method now raises NotImplementedError for violation_type="custom" and points users to refit with the new M instead. The previous equal-weights fallback (with the literal "we can't reconstruct" comment) is replaced with a fail-loud guard at the top of power_at; the inner else branch is retained as a defensive fallback for future violation_type values. Added a Raises section to the docstring documenting the guard. Added a regression test `test_power_at_raises_on_custom_violation_type` to lock the behavior in; full test_pretrends.py suite passes 64/64. - P1 REGISTRY linear-violation deviation Note: added a `**Note (deviation from paper — linear violation pattern):**` block inside the `## PreTrendsPower` entry's Violation-types block describing how the shipped `[n_pre-1, ..., 1, 0]` direction is built from `n_pre` count alone (no relative-time labels), so irregular / anticipation-shifted grids do not yield Roth's γ-units MDV. Cross-referenced to TODO.md. - Silent-failure-guard Note: also added a paired Note documenting the power_at custom raise behavior in REGISTRY.md for symmetry with the linear deviation Note (both label the corresponding TODO follow-up). Pre-existing pyright diagnostics in pretrends.py are unrelated to this diff (my edits are confined to power_at, docstrings, REGISTRY, and a new test method). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R19 verdict was ✅ Looks good — addressing all 3 cleanup items so the
documentation, tracking, and tests align with R18's actual mitigation.
- P2 helper docstrings stale: `compute_pretrends_power` and `compute_mdv`
docstrings still said `power_at()` "silently falls back to equal
weights" even though R18 changed the behavior to raise
`NotImplementedError`. Updated both docstrings to describe the actual
current behavior: helper raises `ValueError` from the underlying
constructor for `violation_type='custom'`; `PreTrendsPowerResults.
power_at()` on a custom class fit raises `NotImplementedError` because
fitted weights are not yet persisted.
- P3 TODO.md row stale wording: `TODO.md:100` row still described the
silent-failure path in present tense. Rewrote to record that the
silent-failure path was mitigated in R18 (locked by
test_power_at_raises_on_custom_violation_type) and narrowed the
remaining follow-up to the weight-persistence work needed to re-enable
`power_at()` for custom fits. Downgraded priority from High to Medium
since the silent-bug surface is closed.
- P3 helper-level regression test: added two new tests to
`TestConvenienceFunctions`:
* `test_compute_pretrends_power_rejects_custom_violation_type` —
confirms ValueError when `compute_pretrends_power(...,
violation_type='custom')` is called without weights.
* `test_compute_mdv_rejects_custom_violation_type` — mirrors the
contract for `compute_mdv`.
Full test_pretrends.py suite: 66/66 passing (was 64 + 2 new helper tests).
Pre-existing pyright diagnostics in pretrends.py + test_pretrends.py are
unrelated to this diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allback R20 verdict was ✅ Looks good. The two methodology + tech-debt P3-informational findings explicitly required "none required for approval" (already tracked via REGISTRY notes + TODO rows). Addressed the one finding with a concrete fix: - P3 maintainability — PreTrendsPowerResults.power_at() still had an unknown-violation_type else fallback that defaulted to equal weights (np.ones(n_pre)). That meant a future violation_type added to fit() but not threaded through power_at() would silently produce equal-weights output. Replaced with an explicit ValueError mirroring the raise at the end of _get_violation_weights() (line 530-531). Aligns the two weight-reconstruction paths and closes the future-drift silent surface. Test suite: 66/66 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI codex Round 1 verdict was ✅ Looks good. Addressing the two actionable P3 doc findings; the other 4 P3 findings were explicit "no fix required" mitigation confirmations. - P3 REPORTING.md parity: the diagonal-covariance-fallback Note still called the deviation "conservative" at REPORTING.md:324-337, but the new authoritative REGISTRY.md note (in this PR) describes it as a non-paper approximation that is NOT provably conservative (direction of the discrepancy with full Σ_22 depends on sign + magnitude of dropped correlations). Aligned the REPORTING note to match the REGISTRY language, kept the description of the BusinessReport well_powered → moderately_powered downgrade as a practical safeguard (not a proof of conservatism), and cross-referenced the REGISTRY authoritative deviation block. - P3 paper-review reproducibility: the roth-2022-review.md header listed "PDF reviewed: papers/roth-2022.pdf" but the /papers/ directory is gitignored, so the path isn't reproducible from the repo alone. Replaced the path with the published-article citation + DOI + an explicit note that the PDF was reviewed externally and that the /papers/ working directory is gitignored. Added pointers to the jonathandroth.com author page for reproduction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/methodology/papers/roth-2022-review.md(298 lines) — methodology audit for the PreTrendsPower estimator, with registry-candidate block holistically split into### Paper-derived requirements+### Library design choicessubsections and explicit boundary markers.REGISTRY.md## PreTrendsPower: 3 new**Note:**blocks documenting deviations surfaced by the audit — diag-Σ_22 fallback on CS/SA adapters, count-basedlinearweights (not Roth's δ_t = γ·t for irregular grids), and the newpower_at(custom)silent-failure guard.diff_diff/pretrends.py:PreTrendsPowerResults.power_at()now raisesNotImplementedErrorforviolation_type="custom"instead of silently substituting equal weights; unknown-violation_typeelse branch hardened to raiseValueError(mirrors_get_violation_weights).METHODOLOGY_REVIEW.md: PreTrendsPower row records paper review on file (2026-05-17); promotion checklist updated.TODO.md: 5 new rows under Methodology/Correctness tracking deferred follow-ups (Σ_22 fidelity, R-package revision pin, helper custom support, weight persistence, linear-units alignment).test_power_at_raises_on_custom_violation_type+test_compute_pretrends_power_rejects_custom_violation_type+test_compute_mdv_rejects_custom_violation_type(93/93 passing acrosstest_pretrends.py+test_pretrends_event_study.py).Methodology references (required if estimator / math changes)
PreTrendsPowerResults.power_at()(silent-failure guard).## PreTrendsPower:diag(ses^2)rather than full pre-period Σ_22 for staggered result types (CS deliberate whereevent_study_vcovis available; SA forced becauseSunAbrahamResultsdoes not expose an event-study VCV surface). Not provably conservative; tracked in TODO.md for the PR-B audit.linearviolation pattern:_get_violation_weights("linear")constructs[n_pre-1, ..., 1, 0]fromn_prealone — does not honor actual relative-time labels, so reported MDV is not in Roth's γ units for irregular / anticipation-shifted grids. Tracked in TODO.md.power_at(custom)silent-failure guard:compute_pretrends_power/compute_mdvrejectviolation_type="custom"(noviolation_weightsparameter); the class-direct path works at fit time butPreTrendsPowerResults.power_at()now raisesNotImplementedErrorfor custom fits (was a silent equal-weights fallback atpretrends.py:230-231). Tracked in TODO.md for weight-persistence follow-up.Validation
tests/test_pretrends.py— 3 new regression tests as listed above. Fulltest_pretrends.py+test_pretrends_event_study.py: 93/93 passing.Security / privacy
🤖 Generated with Claude Code